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Abstract 

Competitive on-line prediction (also known as universal prediction of 
individual sequences) is a strand of learning theory avoiding making any 
stochastic assumptions about the way the observations are generated. The 
predictor's goal is to compete with a benchmark class of prediction rules, 
which is often a proper Banach function space. Metric entropy provides 
a unifying framework for competitive on-line prediction: the numerous 
known upper bounds on the metric entropy of various compact sets in 
function spaces readily imply bounds on the performance of on-line pre- 
diction strategies. This paper discusses strengths and limitations of the 
direct approach to competitive on-line prediction via metric entropy, in- 
cluding comparisons to other approaches. 

1 Introduction 

A typical result of competitive on-line prediction says that, for a given bench- 
mark class of prediction strategies, there is a prediction strategy that performs 
almost as well as the best prediction strategies in the benchmark class. For 
simplicity, in this paper the performance of a prediction strategy will be mea- 
sured by the cumulative squared distance between its predictions and the true 
observations, assumed to be real (occasionally complex) numbers. Different 
methods of competitive on-line predictions (such as Gradient Descent, following 
the perturbed leader, strong and weak aggregating algorithms, defensive fore- 
casting, etc.) tend to have their narrow "area of expertise": each works well for 
benchmark classes of a specific "size" but is not readily applicable to classes of 
a different size. 

In this paper we will apply a simple general method based on metric entropy 
to benchmark classes of a wide range of sizes. Typically, this method does not 
give optimal results, but its results are often not much worse than those given by 
specialized methods, especially for benchmark classes that are not too massive. 
Since the method is almost universally applicable, it sheds new light on the 
known results. 



Another disadvantage of the metric entropy method is that it is not clear 
how to implement it efficiently, whereas many other methods are computation- 
ally very efficient. Therefore, the results obtained by this method are only a 
first step, and we should be looking for other prediction strategies, both com- 
putationally more efficient and having better performance guarantees. 

We start, in by stating a simple asymptotic result about the existence 
of a universal prediction strategy for the class of continuous prediction rules. 
The performance of the universal strategy is in the long run as good as the 
performance of any continuous prediction rule, but we do not attempt to esti- 
mate the rate at which the former approaches the latter. This is the topic of 
the following section, where we establish general results about performance 
guarantees based on metric entropy. For example, in the simplest case where 
the benchmark class T is a compact set, the performance guarantees become 
weaker as the metric entropy of T becomes larger. 

The core of the paper is organized according to the types of metric compacts 
pointed out by Kolmogorov and Tikhomirov in |27| (§3). Type I compacts have 
metric entropy of order log ^ ; this case corresponds to the finite-dimensional 
benchmark classes and is treated in Q Type II, with the typical order log M ~, 
contains various classes of analytic functions and is dealt with in f|SJ The key fjS] 
deals with perhaps the most important case of order (-) 7 ; this includes, e.g., 
Besov classes. The classes of type IV, considered in J7| have metric entropy 
that grows even faster. 

In §2H3the benchmark class is always given. In S]!5]we ask the question of 
how prediction strategies competitive against various benchmark classes com- 
pare to each other. The previous section, prepares the ground for this. The 
concluding section, i ll 01 lists several directions of further research. 

There is no real novelty in this paper; I just apply known results about 
metric entropy to competitive on-line prediction. I hope it will be useful as a 
survey. 

2 Simple asymptotic result 

Throughout the paper we will be interested in the following prediction protocol 
(or its modifications): 

On-line regression protocol 
FOR n — 1,2,...: 

Reality announces x n € X. 

Predictor announces /i„ e M. 

Reality announces y n € [— Y, Y}. 
END FOR. 

At the beginning of each round n Predictor is given some signal x n that might 
be helpful in predicting the following observation y n , after which he announces 
his prediction fi n . The signal is taken from the signal space X, the observations 
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are real numbers known to belong to a fixed interval [— Y, Y], Y > 0, and the 
predictions are any real numbers (later this will also be extended to complex 
numbers). The error of prediction is always measured by the quadratic loss 
function, so the loss suffered by Predictor on round n is (y n — ^ n ) 2 ■ It is clear 
that it never makes sense for Predictor to choose predictions outside [— Y, Y], 
but the freedom to go outside [— Y, Y] might be useful when the benchmark 
class is not closed under truncation. 

Remark Competitive on-line prediction uses a wide range of loss functions 
•My n > /■*■«,)■ The quadratic loss function X(y n , [i n ) := (y„ — fJ, n ) 2 belongs to the 
class of "mixable" loss functions, which are strictly convex in the prediction /i n 
in a fairly strong sense. Such loss functions allow the strongest performance 
guarantees (using, e.g., the "aggregating algorithm" of 42 , which we will call 
the strong aggregating algorithm) . If the loss function is convex but not strictly 
convex in the prediction, the performance guarantees somewhat weaken (and can 
be obtained using, e.g., the weak aggregating algorithm of for a review of 
earlier methods, see ^2]). When the loss function is not convex in the prediction, 
it can be "convexified" by using randomization f|12j. Chapter 4). 

A prediction rule is a function F : X — > R. Intuitively, F plays the role of 
the strategy for Predictor that recommends prediction F(x n ) after observing 
signal x n € X; such strategies are called Markov prediction strategies in |38). 
We will be interested in benchmark classes consisting of only Markov prediction 
strategies; in practice, this is not as serious a restriction as it might appear: 
it is usually up to us what we want to include in the signal space X, and we 
can always extend X by including, e.g., some of the previous observations and 
signals. 

Our first result states the existence of a strategy for Predictor that asymptot- 
ically dominates every continuous prediction rule (for much stronger asymptotic 
results, see gniEZIEHlllSD- 
Theorem 1 Let X be a metric compact. There exists a strategy for Predictor 
that guarantees 



for each continuous prediction rule F. 

Any strategy for Predictor that guarantees JQ) for each continuous F : X — > K 
will be said to be universal (or, more fully, universal for C(X)). Theorem ^ 
asserting the existence of universal prediction strategies, will be proved at the 
end of this section. 

Aggregating algorithm 

This subsection will introduce the main technical tool used in this paper, an 
aggregating algorithm (in fact intermediate between the strong aggregating al- 
gorithm of 021 and the weak aggregating algorithm of [22]). For future use in 
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33 we will allow the observations to belong to the Euclidean space R m (in fact, 
we will only be interested in the cases m = 1 and m = 2). Correspondingly, we 
allow predictions in M m and extend the notion of prediction rule allowing values 
in K m . 

The I2 norm in M. m will be denoted ||-|| 2 or simply ||-||; in later sections 
we will also use l v norms ||-|| for p ^ 2. Reality is constrained to producing 
observations in the ball YU m in R m with radius Y and centred at 0; Uy is our 
general notation for the closed unit ball {v € V \ \\v\\ < 1} centred at in a 
Banach space V, and we abbreviate U m := Uj&m. 

Lemma 1 Let Fi,F%, ... be a sequence of W 11 -valued prediction rules assigned 
positive weights W\ , W2, ■ ■ ■ summing to 1 . There is a strategy for Predictor 
producing ji n 6 YU m that are guaranteed to satisfy, for all N = 1,2,... and all 
1 = 1,2,..., 

N N 

^Ibn-Mnll 2 < ^||2/„-^(x n )|| 2 + 8y 2 ln— . (2) 

n—l n—1 

For m = 1 the constant 8Y 2 in @ can be improved to 2Y 2 (cf. 02 1 Lemma 2 
and the line above Remark 3), and it is likely that this is also true in general. In 
this paper we, however, do not care about multiplicative constants (and usually 
even do not give them explicitly in the statements of our results; the reader can 
always extract them from the proofs). 

Inequality (J2J says that Predictor's total loss does not exceed the total loss 
suffered by an alternative prediction strategy plus a regret term (8Y 2 In i in 
the case of ©); we will encounter many such inequalities in the rest of this 
paper. 

Proof of Lemma ^ Let r\ := g^-, [3 := e~ v , and Pq be the probability mea- 
sure on {1, 2, . . .} assigning weight Wi to each i — 1,2, ... . Lemma 1 and Remark 
3 of 021 imply that it suffices to show that the function (3^y~^ is concave in 
/i G YU m for each fixed y S YU m (this idea goes back to Kivinen and Warmuth 
|25p. Furthermore, it suffices to show that the function 

{}\W+bt\\ 2 = e -n(||a|| 2 +2<a,6)t+||f,|| 2 t 2 ) 

is convex in t E [0,1] for any a and b such that a and a + b belong to the ball 
2YU m of radius 2Y centred at 0. Taking the second derivative, we can see that 
we need to show 

2 ?7 (( a ,6) + ||6|| 2 t) 2 <||fe|| 2 . 

By the convexity of the function (-) 2 , it suffices to establish the last inequality 
for t = 0, 

2r,((a,6)) 2 <||fe|| 2 , (3) 

and t = 1, 

2r ? (( a ,6) + ||6|| 2 ) 2 <||fe|| 2 . (4) 
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Inequality J3J follows, for r\ < 



IF 5 '' 



from 



2??((a,6)) 2 < 2r/||a|| 2 ||6|| 2 < 8r/y 2 ||fe|| 2 < ||b| 



In the case of (@J, it is clear that we can replace a by its projection onto the 
direction of b and so assume a = Xb for some A £ K. Therefore, (@J becomes 



The proof of Lemma ^ exhibits an explicit strategy for Predictor guaran- 
teeing (J2J ; we will refer to this strategy as the A A mixture of F\ , F2, ■ . ■ (with 
weights wi, W2, ■ ■ ■)■ 

Proof of Theorem [T] 

Theorem^ follows immediately from the separability of the function space C(X) 
of continuous real-valued functions on X (JHli Corollary 4.2.18). Indeed, we 
can choose a dense sequence Fx , F%, ■ ■ ■ of prediction rules in C (X) and take any 
positive weights Wi summing to 1. Let F be any continuous prediction rule; 
without loss of generality, F : X — > [— Y, Y\. For any e > 0, the AA mixture 
clipped to [— Y, Y] will satisfy 



where i is such that F% is e-close to F in C(X). Since this holds for any e > 0, 
the proof is complete. 

3 Performance guarantees based on metric en- 
tropy: general results 

In the rest of the paper we will be assuming, without loss of generality, that 
Y = 1. To recover the case of a general Y > 0, the universal constants C in 
Theorems 1201 (and their corollaries) below should be replaced by CY 2 and all 
the norms should be divided by Y. The constants C in those theorems are not 
too large (of order 10 according to the proofs given, but no effort has been made 
to optimize them). 
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We will consider three types of non-asymptotic versions of Theorem^ corre- 
sponding to Theorems °f this section. In the first type the benchmark class 
J 7 is a metric compact, and we can guarantee that Predictor's loss over the first 
./V observations does not exceed the loss suffered by the best prediction rule in 
the benchmark class plus a regret term of o(N) , the rate of growth of the regret 
term depending on the metric entropy of T . In the second type T is a Banach 
function space on X whose unit ball Up is a compact (in metric C(X)) subset 
of C(X). In this case it is impossible to have the same performance guarantees; 
Predictor will need a start (given in terms of their norm in the Banach space) 
on remote prediction rules. Results of this type can be easily obtained from 
results of the first type. In the third type the benchmark class consists of all 
continuous prediction rules; such results can be obtained from results of the 
second type for "universal" Banach spaces, i.e., Banach spaces that are dense 
subsets of C(X). 

Compact benchmark classes 

Let A be a compact metric space. The metric entropy H e (A), e > 0, is defined to 
be the binary logarithm log N of the minimum number of elements F\ , . . . , Fn € 
A that form an e-net for A (in the sense that for each F E A there exists i — 
1, . . . , N such that F and Fj are e-close in A). The requirement of compactness 
of A ensures that Tt e (A) is finite for each e > 0. 

Remark There are four main variations on the notion of metric entropy as 
defined in [57]; our definition corresponds to Kolmogorov and Tikhomirov's 
relative e-entropy 7i.f(A). In general, relative e-entropy Ttf (A) can be defined 
for any metric space R containing A as a subspace (in our applications we would 
take R := C(X)). The other two variations are the absolute e-entropy Hf : ' s (A) 
(denoted simply H. e (A) by Kolmogorov and Tikhomirov; it was introduced by 
Pontryagin and Shnirel'man 31 in 1932, without taking the binary logarithm 
and using e in place of Kolmogorov and Tikhomirov's 2e) and the e-capacity 
£ e (A). All four notions were studied by Kolmogorov, his students (Vitushkin, 
Erokhin, Tikhomirov, Arnol'd), and Babenko in the 1950s, and their results are 
summarized in |27| . It is always true that, in our notation, 

&u(A) < H? S {A) < H?(A) < H e (A) < S £ (A) (5) 

f|27|. Theorem IV). All results in can be applied to all elements of the 
chain (JjjJ, and in principle we can use any of the four notions; our choice of 
H e (A) = TLf{A) is closest to the notion of entropy numbers popular in the 
recent literature (such as |l(Jj). 

Theorem 2 Suppose T is a compact set in C(X). There exists a strategy for 
Predictor that produces fi n with |/x n | < 1 and guarantees, for all N = 1,2,... 



G 



and all F 6 T 

N A 

' (Vn - F( Xn )?+C 

6(0,1/2] 



E (y» - M«) 2 < V (y„ - F{x n ) f+C inf f W e (JF) + loglog - + eN + 1 J , 

(6) 



n— 1 n— 1 

where C is a universal constant. 



Proof Without loss of generality we can only consider e of the form 2~ l , i = 
1,2,.. ., in JHJ). Let us fix, for each i, a 2" J -net Ti for T of size 2 W 2-* (■ ?r ); to each 
element of Ti we assign weight -^ I i~ 2 2~ H ^- i ^ :F \ so that the weights sum to 1. 
Our goal |JBJ will be achieved if we establish, for each i — 1, 2, . . ., 



iV JV 



J] (»n - M«) 2 < E (W» ~ ^«)) 2 +C .Jnf (w a -i (.F)+log z+2- 4 iV+l) (7) 



n— 1 n— 1 



(we let C stand for different constants in different formulas). 

Without loss of generality it will be assumed that F and all functions in Ti, 
z = 1,2,..., take values in [—1, 1]. Fix an i. Let F* £ T be 2~ J -close to F in 
C(X). Lemma Ogives a prediction strategy satisfying 



N N 



IV IV / 2 \ 

E (2/™ - Mn) 2 < E (f» - F *( X ^ + 81n ( yi a 2 w »-^ J 

n=l n=l ^ ' 

< E (v» - F (^)) 2 + 8 ln f y ( ^ ) + 4 ( 2 ~ iN ) 

n=l ^ ' 

iV 

< E (tf» - ^(^0) 2 + C (1 + logi + W 2 -«(.F) + 2"*iV) , 

n=l 

which coincides with Q. I 

Banach function spaces as benchmark classes 

Let J 7 be a, linear subspace of C(X) equipped with a norm making it into a 
Banach space. We will be interested in the case where T is compactly embedded 
into C(X), in the sense that the unit ball 

U r :={Fer\\\F\\r<l} 

is a compact subset of C(X). (The Arzela-Ascoli theorem, ^7], 2.4.7, shows 
that all such T are Banach function spaces with finite embedding constant, 
as defined in |45| and below; in particular, they are proper Banach functional 
spaces.) 
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Theorem 3 Let T be a Banach space compactly embedded in C(X). There 
exists a strategy for Predictor that produces fj, n with \fi n \ < 1 and guarantees, 
for all N =1,2,... and all F G T, 



N N 



Y (Vn - Vnf < Y (f« _ F ( X n)Y 
ri— 1 n—1 



+ C inf (W e/ ^)+loglog-+loglog<£ + eiV+lY (8) 

ee(o,i/2] \ e J 

where C is a universal constant and 

0:=2max(l, \\FW-p). (9) 

Proof Notice that H^U?) = H 2 -i e {U^), i = l,2,.... Applying ® to T := 
2 l Uf, we obtain 

N N . 

Y (f« - ^ D (»» - ^n)) 2 + C n 2 -, c (Ujr) + log log - + eA + 1 



for any e £ (0, 1/2]; we will assign weight Jyi 2 to the corresponding prediction 
strategy. A A mixing the prediction strategies achieving (|1(J|) for i = 1,2,..., 
(it is clear that Lemma ^ is applicable to any prediction strategies, not only 
prediction rules), we obtain a strategy achieving 



N N 



Y " < It, F ( X n)f + C ( H 2 -i t {U?) + log log - + eN + 1 



n—1 n—1 

/~2 

■2 



In(y^) (11) 



for all i = 1, 2, . . . and all F £ 2 l Uf. For each F E T we can set 

i := max(l, [log||F||^]) 
to obtain 2 l < cj> < 2 i+1 and so, from ifTTJl. 



N N 



n=l 



7T 2 

!ln | — 1or 2 ( 



The last inequality can be written as 
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Competing with the continuous prediction rules 

Let T C C(X) be a Banach function space (no connection between the norms 
in T and C(X) is assumed) which is dense in C(X) (in the C(X) metric, of 
course); in this case we will say that T is densely embedded in C(X). The 
approachability of F e C(X) by T is defined as the function 

A?(F) :=ml{\\F*\\ r \ ||F-F*|| C(X) < e] 7 e > 0, (12) 

which is finite under our assumption of density. 

Remark The Gagliardo set of a function F E C(X) can be defined as 
T(F) := {(to, t x ) e K 2 | 3F e C(X), F x e F : F + F x = F, 

||-Fb|| c( x)<*o.ll-Pill^<*i}- ( 13 ) 

(See H, §3.1, for the general definition.) The graph of the function e i— » Af(F) 
is essentially the boundary of T(F). A third way of talking about the Gagliardo 
set is in terms of the norm 

K(t,F):= inf (||F)|| C(X) + * ll^ill^) , (14) 

where t ranges over the positive numbers. (See HJ, §3.1, or [2], 7.8, for further 
details.) 

Theorem 4 Let T be a Banach function space compactly and densely embedded 
in C(X). Theorem\Q's strategy guarantees, for all N = 1, 2, . . . and F G C(X), 

n—1 n—1 

+ C inf [7ie/A( e )(f»+loglog-+logbgi4(c)+cJV + l), (15) 
ee(o,i/2] V e / 

where C is a universal constant and A(e) := 2 max(l, .A^Fn . 

Proof Inequality JHJ immediately implies 

AT N 



X] (Vn ~ M") 2 < X! ( y ™ ~ F(a; n )) z 

n—1 n—1 

+ C inf inf ( n e/A(s) {Ujr) + log log - + log log A(S) + eN + 46N + 1 ] , 

«>0e£(0,l/2] \ £ / 

and it remains to restrict 6 to 5 € (0, 1/2] and set e := <5. I 

Theorem 0] will be the source of many universal prediction strategies. Given 
any of the Banach spaces compactly and densely embedded in C (X) introduced 
in §35l|H| Theorem 0] produces a universal prediction strategy: it is clear that 
lfT5|l implies JTJ). 
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4 Finite-dimensional benchmark classes 

We will be using (following [22]) the notation / ~ g to mean lim e _ > o(/ (e) / ff(e)) = 
1 and the notation / x j to mean / = 0(g) and g = 0(/) as e — > 0, where / 
and <? are positive functions of e > 0. 

If the benchmark class is finite-dimensional, the typical rate of growth of 
its metric entropy is 

W e (^)~Zlogi (16) 

where L is the "metric dimension" of T . This motivates the following corollaries 
of Theorems |2 and |21 respectively. 

Corollary 1 Suppose T is a compact set in C(X) such that 

L := Km sup € (0,oo). (17) 

e^Q log 7 

There exists a strategy for Predictor that guarantees, for all F G T ' , 

N N 

(y» " M») 2 < E " F ^)f + CLlogN (18) 

n—1 n—1 

from some N on, where C is a universal constant. 

Proof It suffices to set e := 1/N in JJjJl. (And it is easy to check that this value 
of e extracts from © an optimal, to within a constant factor, regret term.) I 

Corollary 2 Let J- be a Banach space embedded in C(X) and 

L :=limsup 7ie(L/ f ) € (0,oo). (19) 

e-0 log i 

There exists a strategy for Predictor that guarantees, for all F G T ', 

N N 

(y« - < (f- - ^t^)) 2 + ciiogTv (20) 

n—1 n—1 

from some N on, where C is a universal constant. 

Remember that any Banach spaces T satisfying l|19|) is automatically finite- 
dimensional ([23, Theorem XII). 

Proof of Corollary [2] Since the Banach space T is finite-dimensional, it is 
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compactly embedded in C(X). Substituting e := 1/N in ©, we obtain 



AT 



71=1 

N 



E (Vn ~ Mn) 2 
N 

< E 0» ~ F(x n )f + C (Llog(^JV) + log log iV + log log ^ + 2) 

71=1 

N 

< E {Vn - F{x n ) f + 2CLlogN 



from some TV on. I 

Theorem 0] is irrelevant to this section: no finite-dimensional subspace can 
be dense in C(X) (since finite-dimensional subspaces are always closed). 

Comparison with known results 

It is instructive to compare the bound of Corollary [3 with a standard bound 
in competitive linear regression, obtained in |42j for the prediction strategy 
referred to as AAR in 02] and as the "Vovk-Azoury-Warmuth forecaster" in 
|12| . In the metric entropy method the elements of a net in T (the union of 
e-nets of different balls in T for different e, in the case of Theorem [3] and its 
corollaries) are AA mixed. AAR is conceptually very similar: instead of AA 
mixing the elements of the net, it A A mixes T itself; the weights assigned to 
the elements of the net are replaced by a "prior" probability measure on J-, 
and so summation is replaced by integration. An advantage of this "integration 
method" is that, for a suitable choice of the prior measure, it may produce a 
computationally efficient prediction strategy: e.g., AAR, which uses a Gaussian 
measure as prior, turned out to be a simple modification of ridge regression, as 
computationally efficient as ridge regression itself. 
Suppose that X is a bounded subset of M. m and set 

X 2 := sup ||x|| 2 , X x := sup M^; (21) 
it is clear that X 2 < X^. AAR guarantees 

N N 

E (y™ - ^«) 2 < E (y« - (*> + W 6 \\l + mln ( NX Z> + 1) ( 22 ) 

n—l n—1 

(see 021 j (22) with a := 1 and Y 2 replaced by 1). To extract a similar inequality 
from (|20|l . let U m be the unit ball in M. m equipped with the || • \\ 2 norm, T be 
the set of linear functions x £ X i— > (9, x), 9 £ R m , with the norm \\9\\ 2 , and 
notice that m 

H e (Ur) < X 2 H t (U m ) < X 2 log \(~) . (23) 
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The first inequality in (|23[) follows from X2 being the embedding constant of 
T into C(X) (and also from the Cauchy-Schwarz inequality). The second in- 
equality in (|23[1 follows from the inequality (1.1.10) in |1U) . 

Remark A popular alternative (used in JU] and, in a slightly modified form, 
llSp to the notion of metric entropy Ji^A) is that of entropy numbers e n {A), 
n = 1,2, . . ., defined as the infimum of e such that there exists an e-net for 
A. Notice that the "infimum" here is attained (and so can be replaced by 
"minimum" ) because of the compactness of A n . It is easy to see that 

2««^)=min{n|e„(A)<e}; (24) 

this can be useful when translating results about entropy numbers into results 
about metric entropy. 

Combining ll'.'il with Corollary [21 we obtain the following analogue of (|22|l . 

Corollary 3 Let X be a bounded set in W n and X2 be defined by \21)) . There 
exists a strategy for Predictor that guarantees, for all 9 G W m , 

N N 

(y« - ^ (»« - ( > a; »» 2 + cx 2m \o g N (25) 

n— 1 n— 1 

from some N on, where C is a universal constant. 

An interesting feature of the regret terms in l|22l) and i|25|) is their logarithmic 
dependence on N; some other standard bounds, such as those in [22] and [JJ, 
involve (or similar terms, such as the square root of the competitor's loss). 
It is remarkable that the bound established in the first paper on competitive 
on-line regression, [201 , also depends on N logarithmically; the method used 
in that paper is penalized minimum least squares. An important advantage 
of the bounds given in (111 1241 E] is that the character of their dependence on 
the dimension m allows one to carry them over to infinite-dimensional function 
spaces; these bounds will be discussed again in 



5 Benchmark classes of analytic functions 

In this section we consider classes of analytic functions, and so it is natural 
to consider complex- valued functions of one or more complex variables. The 
observations are now any complex numbers, y n € C, bounded by 1 in absolute 
value, and so prediction rules are functions F : X — * C. Also, in this section 
C(X) will stand for the function space of continuous complex-valued functions 
on X. It is clear that Theorems^Elcontinue to hold in this extended framework. 

According to [21], §3.11, the typical growth rate for the metric entropy of 
infinite-dimensional classes T of analytic functions on X is 

H e (T) x log m+1 -, (26) 
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where m is the dimension of X. (Although intermediate rates such as 

log m+1 i 



log log i 



also sometimes occur.) For such growth rates (the complex versions of) Theo- 
rems [2}0] imply the following three corollaries. 

Corollary 4 Suppose J- is a compact set in C(X) and M > is such that 
L := limsup \ € (0,oo). 

e^O log i 

There exists a strategy for Predictor that guarantees, for all F G T , 

N N 

Y \Vn - M«| 2 < E l»« - f W| 2 + C^log M JV (27) 

n— 1 n— 1 

from some N on, where C is a universal constant. 

Proof As in the proof of Corollary ^ set e := 1/N in ©. I 

Corollary 5 Lei J- be a Banach function space compactly embedded in C(X) 
and M > be a number such that 

L := limsup U ^ U j} € (0,oo). (28) 
log \ 

There exists a strategy for Predictor that guarantees, for all F £ JF. 

N N 

\Vn -Vn\ 2 <Y l»» " F ( :E »)| 2 + Cil °S M ^ ( 29 ) 
n— 1 n— 1 

/rom some N on, where C is a universal constant. 

Proof Following the proof of Corollary we substitute e := 1/N in (JBJ to 
obtain, from some N on: 

N 

AT 

< Y life ~ F{x n f + C (Llog M (<l>N) + log log N + log log + 2) 

n=l 

JV 

<^|j/„-F(a:„)| 2 + C"Llog M Ar, 



n=l 

N 



where C is another universal constant. 
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Unlike in the previous section, Theorem 21 is not vacuous for classes of an- 
alytic functions: as we will sec in the following subsection, there are numerous 
examples of such classes that are compactly and densely embedded in C(X), 
for important signal spaces X. The following is the implication of Theorem 0] 
for the growth rate 1|26[) : unfortunately, this statement still has inf e since the 
growth rate of Af '(F) is unknown. 

Corollary 6 Let T be a Banach function space compactly and densely embedded 
in C(X) and let L and M be positive numbers satisfying \2ty) . There exists a 
strategy for Predictor that guarantees, for all F G C(X), 

N N 



^2 \y n - m«i 2 < iy« ~ F ( x n)Y 



+ C M inf (L(\og+Af(F)) M +L\og M - + eN) (30) 

ee(0,X] v ' 



from some N on, where Cm is a constant depending only on M and log + is 
defined as 

'log* ift>l 
otherwise. 



log+ t := 



Proof It is clear that the optimal value of e in the regret term in (|15|) tends to 
as N — > co, and so the regret term can be bounded above by 

C inf [L log M [ ^ ) + log log - + log log A(e) + eN + 1 

e£ (0,1/2] V V e / e 

1^ 



< C" inf^ I L \Aog + Af(F) + log ^ J 
from some N on. (The case Fef has to be considered separately.) 



Examples 

We will reproduce two simple examples from ; for simplicity we only consider 
analytic functions of one complex variable (although already (2ZJ contains results 
making extension to several variables straightforward) . Remember that the set 
of all complex numbers is denoted C. 

Let K be a simply connected continuum in C containing more than one point 
and G be a region (connected open set) such that K C G C C. The set of all 
complex-valued functions on K that admit a bounded analytic continuation to 
G is denoted Aq . Equipped with the usual pointwisc addition and scalar action 
and with the norm 

:= sup |/(z)|, (31) 
G zea 
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where / : G — > C ranges over the bounded analytic functions and f\jc is the 
restriction of / to K, it becomes a Banach space. Expression (|31ll is well- 
defined by the uniqueness theorem in complex analysis, and the completeness of 
Aq follows from the fact (known as Weierstrass's theorem, [3], Theorem IV. 1.1) 
that uniform limits of analytic functions are analytic. 
It is shown in |27], (139), that 

U t (u A K) ~t(G,K) log 2 ! (32) 

(this was hypothesised by Kolmogorov and proved independently by Babenko 
and Erokhin; in |51| the constant t(G, K) was shown, under mild restrictions, to 
be proportional to the Green capacity of K relative to G). Therefore, Corollary 
[S] gives a strategy for Predictor guaranteeing 

N N 

Y,\y™-Vn\ 2 < E \Vn-F(xn) | 2 + Ct(G, K) log 2 N (33) 

n—1 n—1 

for all F £ Aq and from some N on, where C is a universal constant. 

In many interesting special cases considered in [23, §7, the constant t(G, K) 
has a simple explicit expression, e.g.: 

• t(G, K) = l/\og{R/r) if K = rB and G = KB, R > r > 0, B := U c being 
the open unit disk in C; 

• t{G,K) = 1/(2 log A) if K = [-1,1] and G is the ellipse with the sum 
of semi-axes equal to A > 1 and with foci at the points ±1 (there is a 
misprint in |27) . (131); the correct formula is given in, e.g., |41) . Theorem 
1 in §12). 

Both these expressions were obtained by Vitushkin. 

Let h > 0. The vector space of all periodic period 2tt complex- valued func- 
tions on the real line R that admit a bounded analytic continuation to the strip 
{z £ C | |Im z\ < h} is denoted Ah- The norm in this space is defined by 

ll/Wk : = SU P (34) 

z: |Im z\<h 

where / ranges over the bounded analytic functions on {z||Imz| < h}. Ex- 
pression H34|) is again well-defined and the normed space Ah is complete. The 
estimate of the metric entropy of the unit ball of Ah given in |27| . (130), is 

^( f/ Aj~ r ^— log 2 - (35) 



h log e e 



(Vitushkin). Corollary |31 now gives 



N N 

J2 iy» - ^ni 2 < iw» - F ( x «)i 2 + j log2 N (36) 

n—1 n—1 
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for all F E Ah and from some N on, where C is a universal constant. 

Corollary is applicable to T = Ah for any h > 0, and so Ah gives rise to 
a universal prediction strategy. Indeed, taking X := dD (the unit circle in C) 
and identifying complex- valued functions on 3D with the corresponding periodic 
period 2ir complex- valued functions on R (namely, / : dD — > C is identified with 
the function iel^ / (e lt )), we can arbitrarily closely in C(X) approximate 
each F e C(X) by a trigonometric polynomial (this is Weierstrass's second 
theorem, £Q, §21), whose analytic continuation to {z||Imz| < h} is bounded; 
we can see that Ah is dense in C(X). 

Suppose X C C is compact (in particular, closed). For Aq to be dense in 
C (X) , X must be nowhere dense in C (since limits in C (X) of elements of Aq 
would be analytic in the interior points of X) . If we additionally assume that X 
is simply connected, Mergelyan's theorem ([S2], Theorem 20.5) will guarantee 
that every continuous complex- valued function on X can be arbitrarily closely 
in C(X) approximated by a polynomial. We can see that Aq is dense in C(X) 
provided X is a nowhere dense simply connected compact. The most interesting 
case is perhaps where X = [a, b] is a closed interval in R. 



Dense function spaces popular in learning theory 

Benchmark classes such as Aq and Ah have never been used, to my knowledge, 
in competitive on-line prediction. Familiar rates of growth of the regret term are 
O(logiV) or N a (for a e (0, 1), usually a — 1/2); intermediate rates obtainable 
for Aq and Ah, such as (|33f) and l|36|> . have not been known. 

Several benchmark classes of this type, however, have been implicitly consid- 
ered since they are reproducing kernel Hilbert spaces corresponding to popular 
reproducing kernels (see |40| and |35| for the use of reproducing kernels in learn- 
ing theory and for the theory of reproducing kernel Hilbert spaces, or RKHS 
for brevity). One of such spaces is the Hardy space H 2 (D) restricted to the inter- 
val (—1, 1) of the real line (see, e.g., (30]). Mergelyan's theorem (or Weierstrass's 
first theorem, pQ, §20) immediately implies that for each e > the restriction of 
H 2 (D) to [— 1 + e, 1 — e] is dense in C([— 1 + e, 1 — e}): indeed, each polynomial 
belongs to H 2 (3). (In the multi-dimensional case, this fact was established by 
Steinwart Example 2.) It is easy to see that, when X = [— 1 + c, 1 — e], 
<|33|) holds not only for Aq := A^ but also for Aq replaced by the restriction 
of H 2 {p) to X and for r(G, K) replaced by a suitable constant depending only 
on e. 

Remark The reproducing kernel 



of the Hardy space H 2 (J$) is known as the Szego kernel. In some recent learning 
literature (such as [SB], Example 2, Example 4.24) the restriction of the 
multidimensional Szego kernel to the unit ball in a Euclidean space is referred 
to as "Vovk's infinite-degree polynomial kernel" . The origin of this undeserved 
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name is the SVM manual [3j]; I liked to use the Szego kernel when explaining 
the idea of reproducing kernels to my students. 

Other popular spaces of analytic functions on M. m are the reproducing kernel 
Hilbert spaces corresponding to the "Gaussian RBF kernels" , parameterized by 
a > 0. They are described in [37] (and also earlier in .7 and, more explicitly 
We have for them both the 0(log 2 N) rate of growth of the regret term 
and the denseness in C(K) for each compact K C R m (see Example 1). 
Interestingly these RKHS do not look dense in C(K): it appears that they 
can only approximate functions at the scale comparable with the parameter a 
(perhaps the cause of this illusion is the small metric entropy of these function 
classes). 

In general, it appears that most common reproducing kernels give rise to 
RKHS consisting of analytic functions. Suppose that X is a bounded set in a 
Euclidean space R m . It is often the case that the reproducing kernel ~K(z,w), 

z, w € X, admits a continuation to a neighbourhood O 2 C (C m ) 2 of X analytic 
in its first argument z and remaining a reproducing kernel. By the Tietze- 
Uryson theorem ([12], 2.1.8) there is an intermediate neighbourhood G, such 
that X C G C ~& C O. Let T be the RKHS on G generated by the given 
reproducing kernel K thus extended to G 2 . It is clear that 

cjr := sup y/K(z,z) (37) 

is finite. The set of the evaluation functionals ~K w (z) := K(z, iu), w <G G, is 
dense in T (0, §2(4); for details, see [3|, Theorem 2), convergence in T implies 
convergence in G(G) (by Cjf < oo and [5], §2(5)), each K w is analytic, and 
uniform limits of analytic functions are analytic ([3], Theorem IV. 1.1); therefore, 
T consists of analytic functions. Since Cjf < oo, we have C/jf C CfU A x, and so 
T is compactly embedded in G(X) and, as above, the regret term grows as a 
polynomial of log N. 

Steinwart [21] gives four examples of reproducing kernels on X 2 that can be 
analytically continued to a neighbourhood of X 2 , as in the previous paragraph, 
and whose RKHS are dense in G(X) (we described the first two of his examples 
above) . 

Sometimes formulas for reproducing kernels contain "awkward" building 
blocks such as taking the fractional part ([30], (10.2.4)), absolute value, or min 
( |43| . (8)), and in this case analytic continuation to a neighbourhood is usually 
impossible. Such reproducing kernels are often derived from the correspond- 
ing RKHS that are much more massive than the classes of analytic functions 
considered in this section; such massive classes will be considered in the next 
section. 

6 Sobolev-type classes 

We now return to our basic prediction protocol in which the observations y n 
are real numbers (bounded by 1 in absolute value); G(X) will again denote the 
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continuous real- valued functions on X. 

Typical classes studied in the theory of functions of real variable are much 
more massive than typical classes of analytic functions. In the second part of 
this section we will see examples showing that the typical growth rate for the 
metric entropy of compact classes T of real-valued functions defined on nice 
subsets of a Euclidean space is 

H e (T)~(l/e)\ (38) 

where 7 > is the "degree of non-smoothness" of T . The following two corol- 
laries are asymptotic versions of Theorems for this growth rate. 

Corollary 7 Suppose a compact set T in C(X) and a positive number 7 satisfy 

L := hmsup e (0,oo). 

£ ^o (,1/ej 7 

There exists a strategy for Predictor that guarantees, for all F G T , 

N N 

(Vn ~ M«) 2 < fo» ~ F ( X n)f + CLT^N^TT (39) 

n— 1 n— 1 

from some N on, where C is a universal constant. 
Proof Solving 

L(l/e) 7 + eiV^mm, (40) 

we obtain 

'L 7 \ 



N 



{N^oo) (41) 



and 

i(l/e) 7 + eN = (j^TT +7 -^tt^) LT^N^tt; (42) 

since the first factor on the right-hand side of the last expression always belongs 
to (1,2], it can be ignored. I 

Remark We will have to find minima such as l|40|) on several occasions, and 
for the future reference I will give the general result of the calculation for 

Ae~ a + Be b -> min, (43) 

where A,B,a,b are positive numbers and e ranges over (0,oo). The minimum 
is attained at 

ml 



(44) 



and is equal to 

+ (6/a)*^ A^B^t. (45) 
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Instead of finding the precise minimum in (|43|) , it will usually be more convenient 
to approximate it by equating the two addends in l|43[) . which gives 



e = (A/B)^ (46) 

and so gives the upper bound 

2A^B^ (47) 

for (|2HJ). 

Corollary 8 Let T be a Banach function space compactly embedded in C(X) 
and 7 be a positive number such that 

L :=limsu P y.y € (0,oo). (48) 
e ^o (1/e) 7 

There exists a strategy for Predictor that guarantees, for all F G T ', 

N N 

Y (Vn ~ Mn) 2 < Y (»» ~ F M? + CL^tfr^N^ (49) 



71=1 



from some N on, where C is a universal constant and <f> is defined by 0). 

Proof Substituting Lc/P for L on the right-hand side of (|42|l and ignoring the 
first factor on the right-hand side, we obtain 



Examples 

We will say that a function F defined on a metric space with metric p is Holder 
continuous of order a € (0, 1] with coefficient c > if, for all x and x' in the 
domain of F, \F(x) — F(x')\ < cp a (x,x'). If a = 1, we will also say that F is 
Lipschitzian with coefficient c. 

Let X be an m-dimcnsional (axes-parallel) parallelepiped. Define T to be 
the class of real-valued functions on X that are bounded in C(X) by a given 
constant and whose Ath partial derivatives exist and are all Holder continuous 
of order a with a given coefficient. It is shown in Theorem XIII, that 

He{T) x (1/e) 7 , (3EJ 

where 7 := m/s — m/(k + a) is the "degree of non-smoothness" (I/7 was 
called the "degree of smoothness" by G. G. Lorentz in his review of |23 in 
Mathematical Reviews) and s := k + a is the "indicator of smoothness" ( |27|. 
§3.111). We can now deduce from l|39l) that 

N N 

Y, (Vn ~ M«) 2 < Y (»« " + C F Ntt (50) 

n— 1 n — 1 
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for all F € T and from some N on, where is a constant depending on T but 
nothing else. 

For the class T of Lipschitzian functions with coefficient c defined on an 
interval of the real line of length I and bounded in absolute value by a given 
constant Kolmogorov and Tikhomirov P7] (see their (10), which also remains 
true when H € (A) is replaced by Ttf(A)) obtain the more accurate estimate 
Ti e (J-) ~ cl/e. In this case (|50|) can be replaced by 

N N 

in* - vnf <J2(y*- F (^«)) 2 + cVdN i/2 , (51) 

n— 1 n— 1 

where C is a universal constant. 

Results of this type have been greatly extended in recent years. We will later 
state one such result about Besov spaces (X) . For the general definition of 
Besov spaces see JB], §§2-2,2.5. Besov spaces Bp are Banach spaces (assuming 
Pi 1 > l)j but we will consider them as topological vector spaces (i.e., will regard 
Banach spaces with equivalent norms as the same space). 

Remark A popular definition of Besov spaces is via "real interpolation" (as 
in P|, Chapter 7). For example, according to this definition, J3p j00 (X), where 
s £ (0, oo ) and p € [1, oo), consists of the functions F whose Gagliardo set ljT3)l 
with C(X) replaced by D>(X) and T replaced by the Sobolev space W m > p {X) 
(see P|, Chapter 3) for some integer m > s contains the curve 

{(*o,*x) G r2 I tl~ e ti = c } i 0:=s/m, 
for some positive c; the infimum of c with this property is the norm of F in 

S*oo(X). 

We are only interested in Besov spaces whose domain is the signal space 
X. In the rest of this section it will always be assumed that X is a subset of 
Euclidean space, X C R m , which is a minimally regular domain, in the sense 
that it is bounded and coincides with the interior of its closure (|18|. Definition 
2.5.1/2). 

Every g (X) with s > m/p is compactly embedded in C(X) (apply [T%] . 
(2.5.1/10), to si := s, pi :— p, qi := q, p2 := 92 := 00 and sufficiently small 
s 2 > and remember that ^ S (X) := B? >O OC (X.) are Holder-Zygmund spaces, 
|18| . 2.2.2(iv)). We will be interested only in this case. Edmunds and Triebel's 
general result (Theorem 3.5 of ^HJ applied to Si := s, p\ := p, q\ :— q, s 2 := 0, 
P2 '■= 00 and q% := 1, in combination with (2.3.3/3)) then shows that 

n c (cfc.„cx)) x (i/ £ )" l/s 

(use ()2 If to move between entropy numbers and metric entropy). We can see 
that l|50|) still holds for T a bounded set in a general Besov space B s v (X) ; 
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moreover, Corollary [S] shows that 

N N m 

(Vn ~ Vnf < J2 (Vn ~ F(*n)) 2 + Cx, S , P , g (ll^llsa^x) + l) '"^ 
n—1 n—1 

(52) 

for all F £ g (X) from some N on, where Cx, s ,p,g is a constant depending 
only on X, s,p, q. Setting p and q to oo, we recover (|5U|I . 

To conclude this subsection, let us go back to reproducing kernels. Cuckcr 
and Smale (HS|, Theorem D) show that if J 7 is an RKHS with a C°° reproducing 
kernel on X 2 for a compact set X C R m , 

H e (E/» = O ((l/e) 2 ™/' 1 ) 

for an arbitrary h > m. Corollary[S] (together with its proof, since the L in l|48|l 
is for each 7 and so has to be replaced by an upper bound) shows that, for an 
arbitrarily small 5 > 0, 

N N 
n—1 n—1 

for all F G T from some N on. The regret term in (|53|l is not as good as the 
poly-log regret term for RKHS with analytic reproducing kernels (see p. fTTj) . 
but this is not surprising: the class of analytic functions is known to be much 
narrower than that of infinitely diffcrcntiablc functions (for a useful relation 
between these classes see 3.7.1). 

Comparisons with defensive forecasting 

Many of the Besov spaces g (X) are "uniformly convex", and this makes it 
possible to apply to them a result obtained in |45j using the method of "defensive 
forecasting" . 

Let V be a Banach space and dUv {v E V \ \\v\\ v — 1} be the unit sphere 
in V (the boundary of the unit ball Uy)- A convenient measure of rotundity of 
the unit ball Uy is Clarkson's ^21 modulus of convexity 

V e 6(0,2] (54) 
v/ 

\\ u -v\\v= e 

(we will be mostly interested in the small values of e). 

If a Banach space T is continuously embedded in C(X), the embedding 
constant will be denoted cp: 

c r ■= sup ||F|| cfx s < 00 (55) 
(we have already used this notation in the special case of RKHS: cf. (|37Jl ). 



Su(e) 



inf 

u,v£dUv 





u + v 


0- 






2 
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Proposition 1 (|45|. Theorem 1) Let T be a Banach space continuously em- 
bedded in C(X) and such that 



Ve G(0,2]: Me) > W/p 



(56) 



for some p € [2, oo). There exists a strategy for Predictor producing ji n that are 
guaranteed to satisfy 



N 



N 



40 



n\\F\ 



1) N 



1-1/p 



(57) 



n=l 

for all N = 1,2,. 



n=l 

and all F 6 T . 



It is interesting that in Proposition^-? 7 is not required to be compactly embed- 
ded in C(X). 

It was shown by Clarkson (H^J, §3) that, for p 6 [2, oo), 

Me)>l-(l-(e/2f) 1/p . 

(And this bound was shown to be optimal in |22|.1 This result was extended to 
some other Besov spaces in |15|. Theorem 3: the modulus of convexity of each 
Bcsov space j(J (K m ), s£l,pe [2, oo) and q 6 [p/(p — l),p], also satisfies 



(58) 



Edmunds and Triebel [IB], 2.5.1, define the Besov space (X) on X C K' 
as the set of all restrictions of the functions in Bp q (SL m ) to X with the norm 



\F\ 



:= mf\\F* 



(59) 



where F* ranges over all extensions of F to R m . To check that Bp* g (X) is at least 
as convex as B* (M™) for p > 2 andp/(p-l) < q < p, take any F U F 2 £ B* (X) 
of norm 1 and at a distance of e from each other. If the infima in the definition 
(|59(l of the norms of F% and F 2 are attained, we can take the extensions F£ and 
F 2 to R m of norm 1 and notice that, as \\Fi — F£ \\ Bs ^ Rm ^ > e and the modulus 
of convexity is a non-decreasing function of e (|28|. Lemma I.e. 8), 



F 1 +F 2 



< 



F* 



< 1 



3 (R™)( £ )- 



If the infima are not attained, we can still use a similar argument for p > 2 and 
p/(p — 1) < (7 < p with <S B s g (B m ) replaced by its lower bound given by (|58|) . 
This shows that l|58l) extends to arbitrary domains: 



S B .(X)(e) > 1 - (1 - (e/2) P ) 1/P > (e/2) p /p. 



(60) 
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Let p £ [2,oo), q £ \p/(p— and s £ {m/p, 00). By Proposition 2] and 
H60|) . there exist a constant Cx, s ,p,g > and a strategy for Predictor producing 
/« n that are guaranteed to satisfy 

JV N 
n— 1 n— 1 

for all N = 1, 2, . . . and all F £ ? (X). We can see that defensive forecasting 
works better than metric entropy at the "wild" end of the scale B s v (X) whereas 
metric entropy better copes with smooth functions (at this time we only pay 
attention to the exponent of N, which is more important, from the asymptotic 
point of view as N — ► 00, than the coefficient in front of JV'"): 

• Suppose s £ (m/p, to/2]. The exponent 1 — 1/p of iV in IjtjlH can be taken 
arbitrarily close to 1 — s/m, and we can see that it is then better than the 
exponent of iV in l|52l) : 

s TO 

1 < . 

to m + s 

For example, in the very important case m = 1, s ~ 1/2 (typical trajec- 
tories of the Brownian motion are of this type) defensive forecasting gives 
approximately TV 1 / 2 whereas the method of metric entropy gives approx- 
imately iV 2 / 3 . 

• Suppose s £ (to/2, to). The exponent of N in l|61|l can always be taken as 
1/2, and it is still better than the exponent of N in l|52|l : 

1 to 

2 to + s 

• Suppose s £ [to, 00). A weakness of the method of defensive forecasting 
(in its current state: see, e.g., @J an d [13 m addition to (JHU) is that 
it cannot give regret terms better than O^N 1 / 2 ). Therefore, the method 
of metric entropy beats defensive forecasting for smooth Besov spaces 
B s p q {X), s>m. 

For comparison with (|50|) , define the norm 

IIPII t I jpf \i \D^F(x)-D^F(x')\ \ 

\\F\\ S := max sup \F{x)\ , max sup n ^ 1 , (62) 

\x£X \P\=k x ,x'eX:x^x' |F ~ x II / 

where X is a parallelepiped in R m , f3 — (/3i, . . . , f3 m ) ranges over the multi- 
indices, ||-|| is any standard norm in M m , F : X — > K is k times continuously 
diffcrcntiablc function, a £ (0,1], and s := k + a. It is easy to check that 
the Banach space normed by (|62f) is continuously embedded in B p 2 (X) for any 
s' < s: indeed, it is obvious that the space normed by l|62|l is continuously 
embedded in the Holder-Zygmund space ^ S (X) := B^ t00 {X) ( 18 , 2.2.2(iv), 
1.2.2), and the usual embedding theorem implies that the Holder-Zygmund 
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space is continuously embedded in B*' 2 (X) (HE]> (2.5.1/10), withpi = qi = oo). 
Fixing an arbitrarily small 5 > 0, we deduce from i|61[l that for each s < m/2 
there exists a constant Cx,s,s > such that 

N N 

E (f» - Mn) 2 < E (»» - F (^)) 2 + c x, s , 5 mi + 1) n 1 -^ 8 (63) 

n—1 n—1 

for all AT — 1,2,... and all F with finite || J^jj^- For s above m/2 we have to 
take p = 2 and so obtain 

AT N 

E (Wn - /^) 2 < E («» - F ( a; «)) 2 + Cx, s (||F||, + 1) A^ (64) 

n—1 n—1 

in place of (|63|) : l|64|) starts losing to (|5(J|> when s exceeds m. 

It is also interesting to compare (|64|) for m = s = 1 with (|51|l . Even though 
ll-^llc(X) ^ 1 is the only interesting case, the bound in l|51() still appears better: 
it scales as -Jc in c, whereas scales as c when applied to {F \ \\F\\ S < c}. 
This impression is confirmed by a more careful analysis: 149f) implies 

N N 

E (»» - Mn) 2 < E (»» - F ( 2; «)) 2 + C^//(||F|| S + 1)^ 
n—1 n—1 

for all F with finite \\F\\ g from some on, where C is a universal constant. 
Comparing this with l|64() . we can see another disadvantage of defensive fore- 
casting: the regret term scales as the norm of F (rather than its square root). 

Other methods 

In this subsection I will briefly list some other methods that have been used in 
competitive on-line regression. It appears that the benchmark classes used have 
always belonged to types I or III in the Kolmogorov-Tikhomirov classification. 
This does not mean, however, that the available prediction algorithms can be 
clearly divided into two groups corresponding to types I and III: quite often an 
ostensibly type I algorithm can be easily extended to benchmark classes that 
are infinite-dimensional Hilbert spaces (of type III) using the so-called "kernel 
trick" (HOI) 123 )■ Sometimes the possibility of such an extension is only stated 
(more or less precisely) without the actual extension being carried out. In this 
subsection I will also discuss results of this type (which might involve some 
conditions of regularity that have not been stated explicitly) . 

Perhaps the most popular method for type III benchmark classes is Gradi- 
ent Descent, together with its version, Exponentiated Gradient (the pioneering 
paper is JT]; see also [231 and [§])• It is very efficient computationally and 
often gives right orders of magnitude for the regret term. As an example, Auer 
et al. ( 6 , Theorem 3.1) obtain, for their prediction algorithm using Gradient 
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Descent, the performance guarantee 



N N 



8c^c 2 + 8cyc 



\ 



1 N 

J2(yn-F(x n ))* + c%c* (65) 

71=1 



for all iV and all F £ cll^, where T is a Hilbert space continuously embedded 
in C(X) and c is a known upper bound on H-FH^. The regret term is bounded 
above by 

8c^c 2 + 8cjrc^2N + cjrc 2 , 

and so its growth rate is 0(N 1 ^ 2 ); this is typical for all popular methods for 
type III benchmark classes. For comparison, (I57|l holds with 40 replaced by 2 
when p = 2 (0U, Theorem 1). 

Bounds involving the loss of the competitor in place of N, such as (I65|) . have 
a clear advantage in situations where some competitors perform very well. Such 
bounds can also be obtained using defensive forecasting (see 01], Theorem 2). 

AAR can also be carried over to Hilbert spaces (with the crucial step made in 
|21p. It gives a performance bound similar to i|57|) with p — 2, but iO^cjr + 1 
replaced by 2c?r (03], Theorem 3). A simple example adapted from JIT shows 
that the leading constant 2c cannot be decreased further (|44|. Theorem 4). 
(In general, attention to the constants is a tradition in learning theory that 
distinguishes it from some parts of the theory of function spaces; probably the 
impetus is coming from experimental machine learning with its common struggle 
for small improvements in the performance of prediction algorithms.) 



7 Very big classes 

In this short section we will see an example of a very fast growth rate of the 
regret term, barely below the useless rate of N. This slow rate is achieved not 
because of the richness of the function class T (as in fJS]as compared to JSJ but 
because of the richness of the signal space X itself. 
The corollary of this section is rather specialized. 

Corollary 9 Suppose X is a totally bounded metric space and 7 is a positive 
number that satisfy 

W £ (X) x (l/ e )T, e^O. (66) 

Let T C C(X) consist of the Holder continuous functions of order (3 £ (0, 1] 
with coefficient c > that are bounded in absolute value by a given constant. 
There exists a strategy for Predictor that guarantees, for all F £ T ', 

N N 
n—1 n—1 
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from some N on, where Cp a is a constant depending only on (3 and 7. 

Proof The modulus of continuity of F is w(e) < ce* 3 , and so w _1 (e) > (e/c) 1 /' 3 . 
Substituting this in Theorem XXV (more precisely, (233)) of [57], we have 

\ogH e (T) = O (W (e/ac) i/* /2 (X)) , 
which in combination with l|66|) gives, for small enough e > 0, 

Let /(e) be the right-hand side of the last inequality. To estimate the infimum 
in |6"|. we find e from 

(taking A/' instead of iV 1 / 2 would not improve e by more than a constant factor), 
which gives 



r \ Ph 



and the upper bound 



1 \ (8/7 



J to \\ogN, 

for the infimum in IjBJ, where is another constant depending only on (3 and 
7. I 

In view of l|38l) on p.^]we can take X to be the class of real- valued functions 
on a parallelepiped in a Euclidean space that are bounded in absolute value by 
a given constant and whose fcth partial derivatives exist and are all Holder 
continuous of order a with a given coefficient. The signal space X can now 
be interpreted as the set of images (admittedly, not very good images, without 
sharp boundaries between different objects). 



8 The role of the norm 

Our Theorems in ^cover all values of N, but starting from S^lwe switched 
to stating inequalities that hold from some iV on. This allowed us to simplify 
the statements and to tune our bounds to various parameters of the considered 
benchmark classes. On the negative side, however, some important information 
was lost: for example, the inequality 1)29(1 does not involve the norm H-FH^ of 
F (whereas (14911 retains the information about the norm). The reason is that 
asymptotically, as N — * 00, the effect of H^H^r becomes negligible. This is only 
true, however, if we fix F while letting N — > 00, and it can be argued that this is 
not the only interesting asymptotics. For example, in the experimental machine 
learning, N is often a constant (the size of the given data set) and it is the norm 
ll-FHjr of the contemplated prediction rule F that varies. Another example will 
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be provided by the considerations of the next section, where the norm will be 
chosen as a function of N. In this section we will discuss what happens if all (or 
all but one) values of N are taken into account. Interestingly, this will change 
significantly our comparative evaluation of virtues of some methods. 

Finite-dimensional benchmark classes 

Instead of Corollary we now have: 

Corollary [2T Let J- be a finite- dimensional Banach space embedded in C(X) 
and L > 1 be a number such that 

H e (Ur)<Llog- 
e 

for all e £ (0, 1/2]. There exists a strategy for Predictor that guarantees, for all 
N = 2,3,... and all F ef, 

JV JV 

(Vn ~ Mr*)' < J2 (»« " F ( X »)) 2 + CL ( l0 S + H F II^ + l °S N ) > ( 68 ) 
n—1 n—1 

where C is a universal constant. 

Proof Substituting e := 1/jV in 0, we obtain: 

JV 

Y (Vn ~ M™) 2 
n=l 

JV 

< Y {Vn ~ F{x n )f + C (Llog(^JV) + log log jV + log log + 2) 

n=l 

JV 

< Y - F i x n)f + C (2L lo g + 2L log TV + 2) , 

n=l 

which gives (|68|l (for a different C). I 
Instead of Corollary [3] 

Corollary [HI Suppose X is a bounded set in R m and .X^w > 1- There exists 
a strategy for Predictor that guarantees, for all N = 2, 3, . . . and a?Z G R m , 

JV JV 

E (fn " M«) 2 < E (y« - 2; «)) 2 + CX 2 m (log+ \\9\\ 2 + logjV) , (69) 

n—1 n—1 

where C is a universal constant. 
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The coefficients in front of log TV in the bounds H22|) and (|69fl are not so 
different, m vs. A^m (ignoring the multiplicative constants). The dependence 
on \\9\\ 2 is, however, very different: \\9\\ 2 vs. X 2 m\og + \\0\\ 2 , quadratic in l(2*2*)l 
and logarithmic in 169|) . The explanation is that AAR uses a Gaussian prior, 
and so the weights assigned to remote 9 decay very fast, whereas in the method 
of metric entropy we used slowly decaying weights. The quadratic dependence 
on 1 1 6*| 1 2 is the price that AAR pays for computational efficiency (the former can 
be improved if AAR for different values of a are AA mixed, as in 02], §8, but 
the latter might suffer). 

We can see that the relation between the AAR bound and the bound ob- 
tained using metric entropy is not as straightforward as it seemed in 21 I n fact, 
the bounds are incomparable: among the advantages of (|22|l are its explicitness, 
a better coefficient in front of log A, and the simplicity and efficiency of the 
underlying prediction strategy; however, the dependence of (|69|l on the norm of 
the competitor 9 is better. 

Benchmark classes of analytic functions 

In this subsection we will be using the conventions of i n particular, C(X) will 
be the class of continuous complex-valued functions on X. Instead of Corollary 
0we have: 

Corollary \St Let T be a Banach function space compactly embedded in C(X) 
and L, M £ [1, oo) be numbers such that 

H e (Ur) < L\og M - (70) 
e 

for all e € (0, 1/2]. There exists a strategy for Predictor that guarantees, for all 
N = 2,3,... and all F eJ 7 , 

N N 

J2 ly» - vn\ 2 < J2 \v» - F (OI 2 + C ^ L ( lo § + W F Wr + i°g^) M . (7i) 

n— 1 n— 1 

where Cm is a constant depending only on M . 

Proof Substituting e := 1/N in the complex version of (JSJl, 

N 
n=l 

N 

< Y, \Vn ~ F(x n )\ 2 + C (Llog M (<l>N) + log log AT + log log + 2) 

71=1 

N 

< £ life - F ( x n)\ 2 + C'L (log + log N) M , 

n=l 

where C is another universal constant. I 
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Using Corollary|3I' instead of Corollary[5]gives a strategy for Predictor guar- 
anteeing, instead of 

N N 2 

£ \Vn ~ Vnf < IW» - F M\ 2 + C G,K (l°g + H^lUg + log Jv) (72) 

n— 1 n— 1 

for all F G Aq and all iV = 2, 3, . . ., where C^x is a constant depending on G 
and Jf only. Similarly, instead of (|36[) we have 

N N 

E IV" - ^ ^ E IW" - F ( x «)| 2 + ^ ( lo S + II^Ha, + lo S^) 2 (73) 

n—l n—l 

for all F £ Ah and all N = 2, 3, . . ., where Ch is a constant depending on /i only. 
Notice that the asymptotic expressions (|32[) and l|35|) per se do not provide any 
information on the dependence of Cq.k on G and if and the dependence of Ch 
on h. 

Sobolev-type classes 

Instead of Corollary [S] we now have: 

Corollary [ST Lei T be a Banach function space compactly embedded in C(X) 
and L > 1, 7 > fee numbers satisfying 

n e (Ur) < L(l/ef (74) 

/or a/? e £ (0, 1/2]. There exists a strategy for Predictor that guarantees, for all 
N = 1,2, . . . and all F e T, 

N N 



E (f" ~ ^™) 2 - E ^™ ~~ ^(^n)) 2 
n—l n—l 

/ 1 N \ 

+ C lL->+ 1 4>->+ 1 N->+ 1 +log + log— + loglog^l , (75) 
where C is a universal constant and (f> is defined by 0). 

Proof See the proof of Corollary |HJ (except that we cannot longer ignore the 
log log terms in ©). The only case that remains to be considered is where N is 
so small that e in l|41|) (with L replaced by L<jP) fails to belong to (0, 1/2]. In 
this case, however, the regret term of H75fl exceeds N/2 because of the term eN 
in ||SJ, and so we can take C := 2. I 

We will refrain from stating the non-asymptotic versions of the inequalities 
derived in |H]for specific function classes: such versions would be awkward and 
would add little to our understanding of the dependence of the regret term on 
the competitor's norm. 
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9 Super-universal prediction? 



In §fJSHHwe dealt with universal prediction in the following, somewhat vague (as 
most of our informal discussion in this section), sense: for a wide (in any case, 
dense in C(X)) class T of continuous prediction rules find a prediction strategy 
competitive with all F 6 T . Possible dense classes F can be of very different size 
even when defined on the same domain X. Even such meagre (barely infinite- 
dimensional from the point of view of metric entropy) function classes as Aq 
and Ah of §5] are dense (and so lead to a universal prediction strategy, in the 
sense of Theorem^. The classes of §f)]are much larger. However, we never know 
in advance which class T will work best for our data sequence x\, y\, x%, y2, ■ ■ ■', 
it would be ideal to have a prediction strategy that works well for many different 
T simultaneously. The study of existence of such "super-universal" prediction 
strategies is a vast understudied (and ill-defined) area, and in this section I will 
only make several simple and random observations. 

Remark There is a cheap way of achieving "super-universality" : we can AA 
mix prediction strategies corresponding to many different classes T . This would, 
however, further impair computational efficiency and possibly lead to cumber- 
some performance guarantees (remember that the classes we are interested in, 
such as A^, Aq , Bp , often depend on one or more parameters). 

Let us say that a function class T\ C C(X) "dominates" a function class 
T-2 C C(X) if any prediction strategy that performs not much worse than the 
best small-norm prediction rules in T\ automatically performs not much worse 
than the best small- norm prediction rules in Ti ■ (The corresponding definition 
for the case where T\ and Ti are compact classes of functions is simpler: we 
can ignore the "small-norm" qualification.) We will not try to formalize this 
"definition" in this paper. 

Since we do not have any lower bounds in this paper, when discussing the 
relation of domination we will be comparing the available performance guaran- 
tees rather than the optimal ones. Hopefully, this will be corrected in the future 
work. 

Ideally, there would be one or very few classes T that would dominate nu- 
merous other natural classes. We will see in this section that less massive 
classes often dominate more massive ones (of course, with all the qualifications 
mentioned above). There is no hope for finite-dimensional classes to dominate 
infinite-dimensional ones, and in the three subsections of this section we will 
discuss the relation of domination between type II classes and between type III 
classes, and to what degree type II can dominate type III. 

Domination between some classes of analytic functions 

In this and following subsections, unlike §S] we will consider periodic period 2n 
real- valued functions on R; the function space Ah is now defined as the class of 
all such functions that can be analytically continued to {z | |Imz| < h}, with 
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the norm defined to be the supremum norm of the analytic continuation (which 
is unique). 

We will be interested in the quality of competition with prediction rules in 
Ah achieved by the prediction strategy designed for competing with prediction 
rules in Ah for H > h. But first we prove an auxiliary result. 

Lemma 2 Let < h < H < oo and let F 6 Ah- For small enough e > 0, 

log A? H (F) <C^logi (76) 
where C is a universal constant. 

Proof According to Achieser's theorem (|3S], 5.7.21; 1 , §94) for sufficiently 
large J there is a trigonometric polynomial of degree J at a uniform distance 
from F at most 

where c := \\F\\ A . To make sure that this does not exceed e > (assumed 
sufficiently small), it suffices to set 



1 8c' 
-In — 

h ire 



(77) 



The absolute value of the approximating trigonometric polynomial does not 
exceed + e on the real line and so does not exceed 

(\\F\\ c{ m + t) eJH ( 78 ) 

in the strip |Imz| < H (this follows from the Phragmen-Lindelof theorem: see 
|55] . p. 13, footnote ****). Substituting O int o we find 

logAi"(F) < \og(\\F\\ c(m +e) + (log e)JH < C^log 1 



c 



for small enough e > 0. I 

Combining Lemma with Theorem 0] we obtain the following corollary. 

Corollary 10 Let < h < H < oo. The strategy for Predictor constructed in 
fJ3| for the benchmark class Ah guarantees 

N N 2 

]T (y n - Mn) 2 < E (v» - F ( x «)) 2 + C ^r lo s 2 N ( 79 ) 

n—1 n—1 

for each F € Ah from some N on, where C is a universal constant. 
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Proof The regret term in 1)30[) can be bounded above by 



C' 



A I 



L log 1 - + eN 
e 



< C 



e=l/N 



eN 



6=1 /N 



< C '^log 2 iV. 



In this chain, we set M := 2 and L := l/h (cf. (J3SJ). I 

The regret term in lt?§|l is not quite as good as the regret term ^ log N that 
would be obtained if we used the right value h instead of using H (cf. ijSBJl), but 
the difference is not great. 



Prom classes of type II to classes of type III 

In this subsection we will see how well prediction strategies designed for type II 
classes can cope with type III classes. The difference between the sizes of the 
classes of different types is huge, and the leap might lead to losing half of the 
smoothness of type III classes. 

Lemma 3 Let h > and let F : K — * M be a non-zero periodic function with 
period 2ir whose kth derivative (k £ {0, 1, . . .}) exists and is Holder continuous 
of order a G (0, 1] with coefficient c. Set s := k + a. For small enough e > 0, 



'12c\ 1/5 
log A? h {F) <Ch[ — \ 



(80) 



where C is a universal constant. 



Proof We will emulate the proof of Lemma|21 According to Jackson's theorem 
C|29|. Theorem 2 in §IV.3) there is a trigonometric polynomial of degree J at a 
uniform distance from F at most 



12 fe+1 c(l/J) Q 



12 k+1 cJ- 



This distance will not exceed e > if we set 



J := 



12 fe+1 < 



l/s 



(81) 



The absolute value of the approximating trigonometric polynomial does not 
exceed 



C(K) 



(82) 



(cf. O) in the strip jlmzj < h, and so we can substitute l|81|l into (|82|) to find 



log^ h (^) < log(\\F 



e) + (loge)h 



which for small enough e gives (|80ll with any C > 12 log < 



k+l. 



l/s 
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Remark Another way of deriving an estimate for Af h (F) for a smooth F would 
be to combine Kolmogorov's estimate of the remainder of the Fourier series 
with the known results about the size of coefficients in Fourier series §1.24). 
This would, however, produce a weaker result. 

Combining Lemma with Theorem 0] we now obtain: 

Corollary 11 Let F : K — > R be a periodic period 2n function whose kth deriva- 
tive (k > 0) is Holder continuous of order a with coefficient c. The strategy for 
Predictor constructed for the class Ah guarantees 

N N 

(Vn ~ Mn) 2 <^2{y n ~ F{x n )f + Ch^c^N^h (83) 

n—1 n—1 

from some N on, where s := k + a and C is a universal constant. 

Proof The proof is similar to that of Corollary^] The regret term in l|30[) can 
be bounded above by 

C' M inf [ Lh M (— ) M/S + eN 
ee(o,i] ^ \ e J 

(it is clear that the term Llog M - can be ignored). Using the upper bound l|47|l 
for QSQ. we obtain 

. s Ms , , M M 

2C M L^+^ (12c) ^ N^+™ . 

Ignoring 12 M /( S + M ) e (1, 12) and substituting M := 2 and L := 1/h (cf. ©), 
we reduce this to (JS3J. I 

The growth rate N 2 ^ s+2 ^ — AT 1 /( s / 2+1 ) of the regret term in (JSSJl is worse 
than the rate N 1 ^ s+1 ' > obtained in Sj^l (see (1501) 1 for a prediction strategy de- 
signed specifically for functions with Holder continuous derivatives. We can say 
that one loses half of the smoothness of F when using the wrong benchmark 
class. 




Domination between Sobolev-type classes 

We first state a trivial corollary of the definition of real interpolation in terms 
of the K-method (in the form of the "approximation theorem" in 2;, 5.31- 
5.32). For the definition of the K-method, see, e.g., 0, §3.1, or g], 7.8-7.10; 
the notation (Xo, Xi)g, q below can be understood to be the abbreviation for 
(Xq, Xi)s , q ,K ■ We will be mostly interested in the case q — oo. 

Lemma 4 Let (Xq,Xi) be an interpolation pair and 9 6 (0,1); set X := 
(Xo,Xi)e -00 . For each F S X and each t > there exists F t G X\ such 
that 

[\\F-F t \\ Xa <2t°\\F\\ x 

\\\F t \\ Xi <2t^\\F\\ x . 1 ] 
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Proof Since the function 

(this is a generalization of 11411 ) is continuous in t (]5], Lemma 3.1.1), we have, 
by the definition of the K-method: 

||F|| X = sup t- e K(t,F) 
te(o : oo) 

= sup M{t- e \\F \\ x +t L - e \\F 1 \\ x \F = Fo + F 1 ,Fo&X 0> F 1 &X 1 }. 

te(o,oo) 

Therefore, for each t > there is a split F — Fq + Fx such that 

t- \\F o \\ Xo +t 1 - \\F l \\ Xi <2\\F\\ x . / 
which is stronger than the statement of the lemma. I 
By 0, Theorem 6.4.5(1), 

,s + ax (B^ qo ,B^ qi ) er = B W'°+0'\ (86) 

and applying this to the Holder-Zygmund spaces < rf s (X.) := ^(X) we obtain 
the following corollary of Lemma 0] 

Corollary 12 Let < s < S < oo. For each F £ "if a (X) and each e > there 
exists F e e c € s (X) such that 

i\\F-F4c W <Ce s \\F\\^ {x) 
\\\F^s {x) <2e s - s \\F\\^ {x)1 

where C is a universal constant. 

Proof Setting 9 :— s/S, we obtain from (|86|l : 

( \ = R s 

V OOjl ' 00,00/ g/g QQ OO, OO - 

Remember that there is a continuous embedding _B^ ol (X) C(X) (|18|. 
(2.3.3/3)). It remains to set t := e s in (JHSJl. I 

Let us apply the last corollary to the case of the performance bound 
That bound (together with its derivation, involving the ^ S (X) norm) gives the 
regret term of order, approximately, 

ll^ll^ ( x)^ Wm (87) 
for the benchmark class ^ S (X), < s < m/2, and of order 

\\Fhs {x) N^ s / m (88) 
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for the benchmark class c tf s (X), < S < m/2. Suppose s < S and let us see 
when a prediction strategy ensuring regret term l|88|) for C ^' S (X.) automatically 
ensures regret term <|87ll for ^(X). 

Corollary 1121 guarantees that every prediction strategy ensuring regret term 
l(S5|) for F S ^ 5 (X) ensures regret term 

1% (ll^ll^(x) N 1 - 3 ^ + \\F - F e || c(x) N) 

< j°f (2 ll^ll^ ( x) N 1 ~ s / m e s ~ s + C ||F||^ (X) Ne s ) (89) 
for F S ^ S (X). Using the upper bound (|47|l for (|89|l . we obtain regret 

2 (2 ||F||^ (X) N 1 ^™) # (C ||F||^ (X) N) V , 

which coincides, to within a constant factor, with 1)87(1, 

We can see that the case s « to/2 in (|63|l dominates all other cases with 
s < to/2. The bound for the case s ~ m/2 was derived in (44) using Hilbert- 
space methods (applicable when p — 2). The Banach-space methods developed 
in |45| might eventually turn out to be less important (but remember that we 
only considered Besov spaces with p and q set to infinity). 

To extend this analysis to the case s > to/2, we will have to compare regret 
terms of order 

for the benchmark class ^ S (X) and 

wn^N^ (91) 

for ^ S (X), where < s < S (see (I52|) with p and q set to 00, as in the case of 
1)50(1 ^). Since our comparison is informal anyway, we will ignore the log log terms 
in (f?5)l . Corollary ITU and the upper bound lETTI) imply that every prediction 
strategy ensuring regret term (|91|l for < ^ S (X) will also ensure regret term 

^ (ll^llSfx) N ^ + \\ F - ^Hc(x) N) 

< inf (2 ||F||gf x) N^he^ 3 ^ + C ||F||^ (X) Ne°) 

< C (\\F\\^ X) N^h) " (||F||^ (X) N) ™ 

= C'(||F||^ (x) iv)^ (92) 

for ^ S (X). The regret rate obtained is as good as (|90|l . to within a constant 
factor. Therefore, as far as our bounds are concerned, < tf s (X.) dominates Sf a (X). 
Unfortunately, these bounds are known to be loose (see ^Jl , at least in the case 
of low smoothness, and it remains to be seen whether the domination still holds 
for tighter bounds. 



35 



10 Conclusion 



In this paper we have seen the following typical rates of growth of the regret 
term: 

(I) for finite dimensional T (type I of |27], §3), 

O(logiV); 

(II) for classes T of analytic functions of m variables (type II of |27|1. 

O (log™ +1 AT) ; 

(III) for classes T of functions of m variables with smoothness indicator s (type 

in of EH), 

(IV) for classes T of Lipschitzian functionals on classes of the previous type 
(such T are representative of type IV of |27j). a typical rate is 

O (N/ log s/m Nj . 

Rates of types I and III have been known in competitive on-line prediction, 
whereas types II and IV appear new. For the first time we can see enough 
fragments to get an impression of the big picture. These are still small fragments 
and the picture is still vague. The method of metric entropy, despite its wide 
applicability, is not universal and often does not give optimal results. My goal 
was to convince my listeners or readers that the arising questions are interesting 
ones. 

We have also considered, in a very tentative way, the question of how much 
one has to pay for using a wrong benchmark class (©. From the available 
very preliminary results it appears that using meagre (albeit dense in C(X)) 
benchmark classes is safer than using rich classes. 

These are possible directions of theoretical research: 

• Find computationally efficient prediction strategies for benchmark classes 
such as Aq and Ah (type II) and Besov spaces with m/(m + s) < 1/2 (in 
the notation of (|52[) ). 

• Find uniform in N estimates of metric entropy (for applications such as 
those in §SEJ>. 

• Extend this paper's results to discontinuous prediction rules (for estimates 
of metric entropy in this case see, e.g., |14) V 

• Perhaps most importantly, complement performance guarantees such as 
those in this paper with lower bounds. A lower bound corresponding 
to Proposition with p = 2 is proved in (21, Theorem 4; however, the 
function space T constructed there is not compactly embedded in C(X), 
and so not interesting from the point of view of metric entropy. 
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In experimental research, it would be interesting to find out the "empirical 
approachability function" 



-4f' 2 (a;i,2/i, . . -,x N ,y N ) := inf < \\FWp 



1 N 1 

n— 1 J 



(93) 



(cf. 1I2JI : the upper index 2 refers to using the quadratic loss function in this 
definition) for standard benchmark data sets 

(xi,yx, . . -,x N ,y N ) := ((a;i,yi), . . . , (xn,Vn)) (94) 

and standard function classes T . It is clear that l)93|) will be finite for all e > 
if T is dense in C(X) and 

Xn± — %Ti2 s Vm = Un2- 

If a prediction strategy guarantees a regret term of \\jr , N) (we will assume 
that / is a continuous function in its first argument), in the sense that 

JV N 

E (y» - ^«) 2 < E (w» - F ( s »)) 2 + - ^) 

n— 1 n— 1 

for all F e J 7 and all ./V = 1,2,..., the loss of this prediction strategy on the 
data set Q94t will be at most 



inf 



f (/ (A? ' 2 ( Xl , yi ,...,x N ,y N ),N)+eN). 



Knowing typical empirical approachability functions l|93() for various function 
classes might suggest function classes most promising for various practical prob- 
lems. 

A natural next step would be to compare different benchmark classes on 
real-world data sets. This is a task for experimental machine learning; what 
learning theory can do is to study the relation of domination between various 
a priori plausible benchmark classes: e.g., some of them may turn out to be 
useless or nearly useless on purely theoretical grounds. 



Acknowledgments 

This paper was written to support my talk at the workshop "Metric entropy and 
applications in analysis, learning theory and probability" (Edinburgh, Scotland, 
September 2006). I am grateful to its organizers, Thomas Kiihn, Fernando 
Cobos and W. D. Evans, for inviting me. This version of the paper is preliminary 
and is likely to be revised as a result of discussions at the workshop. 

Nicolo Cesa-Bianchi, Gabor Lugosi, Steven Smale and Alex Smola supplied 
the principal components of this paper with their incisive questions and com- 
ments. Ilia Nouretdinov's help was invaluable. This work was partially sup- 
ported by MRC (grant S505/65). 



37 



References 



[1] Naum I. Achieser. Theory of Approximation. Ungar, New York, 1956. 

[2] Robert A. Adams and John J. F. Fournicr. Sobolev Spaces, volume 140 
of Pure and Applied Mathematics. Academic Press, Amsterdam, second 
edition, 2003. 

[3] Lars V. Ahlfors. Complex Analysis. International Scries in Pure and Ap- 
plied Mathematics. McGraw-Hill, New York, third edition, 1979. 

[4] Nachman Aronszajn. La theorie generale des noyaux reproduisants et ses 
applications, premiere partie. Proceedings of the Cambridge Philosophical 
Society, 39:133-153 (additional note: p. 205), 1943. The second part of this 
paper is [5]- 

[5] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the 
American Mathematical Society, 68:337-404, 1950. 

[6] Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self- 
confident on-line learning algorithms. Journal of Computer and System 
Sciences, 64:48-75, 2002. 

[7] V. Bargmann. On a Hilbert space of analytic functions and an associated 
integral transform, part 1. Communications on Pure and Applied Mathe- 
matics, 14:187-214, 1961. 

[8] Nina K. Bary. A Treatise on Trigonometric Series. Macmillan, New York, 
1964. In two volumes. Ralph P. Boas, Jr., is very critical of the English 
translation in his review in Mathematical Reviews. Russian edition: Bari, 
Nina K. Trigonometricheskie ryady. Fizmatlit, Moscow, 1961. 

[9] Joran Bergh and Jorgen Lofstrom. Interpolation Spaces: An Introduc- 
tion, volume 223 of Die Grundlehren der Mathematischen Wissenschaften. 
Springer, Berlin, 1976. 

[10] Bernd Carl and Irmtraud Stephani. Entropy, Compactness and the Ap- 
proximation of Operators, volume 98 of Cambridge Tracts in Mathematics. 
Cambridge University Press, Cambridge, England, 1990. 

[11] Nicolo Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst- 
case quadratic loss bounds for on-line prediction of linear functions by 
gradient descent. IEEE Transactions on Neural Networks, 7:604-619, 1996. 

[12] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. 
Cambridge University Press, Cambridge, England, 2006. 

[13] James A. Clarkson. Uniformly convex spaces. Transactions of the American 
Mathematical Society, 40:396-414, 1936. 



38 



[14] G. F. Clements. Entropies of sets of functions of bounded variation. Cana- 
dian Journal of Mathematics, 15:422-432, 1963. 

[15] Fernando Cobos and David E. Edmunds. Clarkson's inequalities, Besov 
spaces and Triebel-Sobolev spaces. Zeitschrift fur Analysis und ihre An- 
wendungen, 7:229-232, 1988. 

[16] Felipe Cucker and Steve Smale. On the mathematical foundations of learn- 
ing. Bulletin (New Series) of the American Mathematical Society, 39:1-49, 
2002. 

[17] Richard M. Dudley. Real Analysis and Probability, volume 74 of Cambridge 
Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 
England, revised edition, 2002. 

[18] David E. Edmunds and Hans Triebel. Function Spaces, Entropy Numbers, 
Differential Operators, volume 120 of Cambridge Tracts in Mathematics. 
Cambridge University Press, Cambridge, England, 1996. 

[19] Ryszard Engclking. General Topology, volume 6 of Sigma Series in Pure 
Mathematics. Heldermann, Berlin, second edition, 1989. 

[20] Dean P. Foster. Prediction in the worst case. Annals of Statistics, 19:1084- 
1090, 1991. 

[21] Alex Gammerman, Yuri Kalnishkan, and Vladimir Vovk. On-line predic- 
tion with kernels and the Complexity Approximation Principle. In Max 
Chickering and Joseph Halpern, editors, Proceedings of the Twentieth An- 
nual Conference on Uncertainty in Artificial Intelligence, pages 170-176, 
Arlington, VA, 2004. AUAI Press. 

[22] Olof Hanner. On the uniform convexity of L p and l p . Arkiv for Matematik, 
3:239-244, 1956. 

[23] Yuri Kalnishkan and Michael V. Vyugin. The Weak Aggregating Algorithm 
and weak mixability. In Peter Aucr and Ron Mcir, editors, Proceedings of 
the Eighteenth Annual Conference on Learning Theory, volume 3559 of 
Lecture Notes in Computer Science, pages 188-203, Berlin, 2005. Springer. 

[24] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated Gradient ver- 
sus Gradient Descent for linear predictors. Information and Computation, 
132:1-63, 1997. 

[25] Jyrki Kivinen and Manfred K. Warmuth. Averaging expert predictions. In 
Paul Fischer and Hans U. Simon, editors, Proceedings of the Fourth Euro- 
pean Conference on Computational Learning Theory, volume 1572 of Lec- 
ture Notes in Artificial Intelligence, pages 153-167, Berlin, 1999. Springer. 

[26] Andrei N. Kolmogorov. Zur Grossenordnung des Restgliedes Fourierschen 
Reihen differenzierbarer Functionen. Annals of Mathematics, 36:521-526, 
1935. 



39 



[27] Andrei N. Kolmogorov and Vladimir M. Tikhomirov. e-entropy and e- 
capacity of sets in functional spaces (in Russian). Uspekhi Matematicheskikh 
Nauk, 14(2):3-86, 1959. 

[28] Joram Lindenstrauss and Lior Tzafriri. Classical Banach Spaces II: Func- 
tion Spaces, volume 97 of Ergebnisse der Mathematik und ihrer Grenzgebi- 
ete. Springer, Berlin, 1979. 

[29] Isidor P. Natanson. Constructive Function Theory, volume 1: Uniform 
Approximation. Ungar, New York, 1964. 

[30] Vern I. Paulsen. An introduction to the theory of reproducing kernel Hilbert 
spaces. Course notes, available from the author's web page (accessed in 
August 2006), February 2006. 

[31] Lev S. Pontryagin and Lev G. Shnirel'man. Sur une propriete metrique de 
la dimension. Annals of Mathematics (New Series), 33:156-162, 1932. 

[32] Walter Rudin. Real and Complex Analysis. International Series in Pure 
and Applied Mathematics. McGraw-Hill, New York, third edition, 1987. 

[33] Saburou Saitoh. Integral Transforms, Reproducing Kernels and their Appli- 
cations, volume 369 of Pitman Research Notes in Mathematics. Longman, 
Harlow, England, 1997. 

[34] Craig Saunders, Mark O. Stitson, Jason Weston, Leon Bottou, Bernhard 
Scholkopf, and Alexander J. Smola. Support vector machine reference man- 
ual. Technical Report CSD-TR-98-03, Department of Computer Science, 
Royal Holloway, University of London, 1998. 

[35] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels. MIT 
Press, Cambridge, MA, 2002. 

[36] Ingo Steinwart. On the influence of the kernel on the consistency of support 
vector machines. Journal of Machine Learning Research, 2:67-93, 2001. 

[37] Ingo Steinwart, Don Hush, and Clint Scovel. An explicit description of 
the reproducing kernel Hilbert spaces of Gaussian RBF kernels. Technical 
Report LA-UR 04-8274, Los Alamos National Laboratory, 2004. 

[38] Aleksandr F. Timan. Theory of Approximation of Functions of a Real 
Variable. Pergamon Press, Oxford, 1963. 

[39] Hans Triebel. Theory of Function Spaces II, volume 84 of Monographs in 
Mathematics. Birkhauser, Basel, 1992. 

[40] Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 

[41] Anatoly G. Vitushkin. Otsenka slozhnosti zadachi tabulirovaniya. Fizmat- 
lit, Moscow, 1959. English translation: Theory of the Transmission and 
Processing of Information, Pergamon Press, Oxford, 1961. 



40 



[42] Vladimir Vovk. Competitive on-line statistics. International Statistical 
Review, 69:213-248, 2001. 

[43] Vladimir Vovk. Non-asymptotic calibration and resolution. Technical Re- 
port arXiv: cs .LG/0506004 (version 3), arXiv . org e-Print archive, August 
2005. 

[44] Vladimir Vovk. On-line regression competitive with reproducing ker- 
nel Hilbert spaces. Technical Report arXiv : cs . LG/0511058 (version 2), 
arXiv.org e-Print archive, January 2006. 

[45] Vladimir Vovk. Competing with wild prediction rules. Technical Report 
arXiv:cs.LG/0512059 (version 2), arXiv.org e-Print archive, January 
2006. 

[46] Vladimir Vovk. Predictions as statements and decisions. Technical Report 
arXiv : cs . LG/0606093, arXiv.org e-Print archive, June 2006. 

[47] Vladimir Vovk. Competing with stationary prediction strategies. Technical 
Report arXiv:cs.LG/0607067, arXiv.org e-Print archive, July 2006. 

[48] Vladimir Vovk. Competing with Markov prediction strategies. Technical 
Report arXiv:cs.LG/0607136, arXiv.org c-Print archive, July 2006. 

[49] Vladimir Vovk. Leading strategies in competitive on-line prediction. Tech- 
nical Report arXiv : cs . LG/0607134, arXiv . org e-Print archive, July 2006. 

[50] Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS- 
NSF Regional Conference Series in Applied Mathematics. SIAM, Philadel- 
phia, PA, 1990. 

[51] Harold Widom. Rational approximation and n-dimensional diameter. Jour- 
nal of Approximation Theory, 5:343-361, 1972. 



41 



